In this lab we are going to work on how to estimate the background from 'real' data. Real is in air quotes because the data is actually from simplified simulations to make the problems manageable in a single lab. But the data will have some features that resemble that of real data sets.
In general exchanging raw data is a pain. Writing data in text files is error prone, inaccurate, and wastes space, but raw binary files have all sorts of subtleties too (including the internal byte order of your processor). To try and sidestep a whole set of nasty issues, we are going to use HDF5 (originally developed by the National Center for Supercomputing Applications for data exchange) to allow everyone to import the data.
Instructions for downloading files on my own local machine:
If you are using python on your own machine, you will need to install the h5py library. Go to your Anaconda environment (if you followed the suggestions in the first lab) and search for the h5py library. Install that library into your environment, then restart your jupyter Lab or notebook session.
If you are working in the cloud python the necessary library is already installed, but you need to follow a magic incantation to download the file from the course website to your cloud instance. From the website find and copy the link address to the data file (often this means right or option-clicking on the link). Then open the terminal in your cloud instance, navigate to your working directory, and use the following command structure
So my command (your link might be different) for the first file is
Which I can see with ls -lh has downloaded a 792 MB file to my directory named gammaray_lab4.h5
Now start your lab with an import block that looks like:
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import scipy
from scipy import stats
import h5py
import pandas as pd
#this sets the size of the plot to something useful
plt.rcParams["figure.figsize"] = (20,15)
Now import the file for problem 1 using
hf = h5py.File('gammaray_lab4.h5', 'r')
Within hdf5 files you can store different kinds of data sets. Ours is simple and has one called 'data'. You can look at the header and see this using
hf.keys()
<KeysViewHDF5 ['data']>
You can then import the data into an array variable using the get method
data = np.array(hf.get('data'))
Check that your data is as expected, with a time (in gps seconds), Solar phase (deg), Earth longitude (deg), and gamma-ray counts, and more than 25 million data records. Printing the first row you should get:
data[:,0]
array([9.40680016e+08, 3.15000000e+02, 4.50000000e+01, 1.00000000e+01])
You can then close your file
hf.close()
In this problem we are looking at the data from a gamma-ray satellite orbiting in low Earth orbit. It takes a reading of the number of particles detected every 100 milliseconds, and is in an approximately 90 minute orbit. While it is looking for gamma-ray bursts, virtually all of the particles detected are background cosmic rays.
As with most data, there are 'features.' The meta-data is incorporated into the data file we imported above.
The data has 4 columns and more than 25 million rows. The columns are time (in gps seconds), Solar phase (deg) showing the position of the sun relative to the orbit, Earth longitude (deg) giving the position of the spacecraft relative to the ground, and particle counts.
In order to get to know our data better, we will be plotting it and its metadata below to get to know it better.
Through a scavenger hunt of trial and error we will be looking for patterns in the data so that we can infer the properties of the data.
I will be using pandas to create a dataframe for my problems.
def plot_line_step(X, Y, start, end, step):
fsize = 30
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.plot(X[start:end:step], Y[start:end:step], label = 'Data', linewidth = 3)
ax.set_xlabel(f'{X.name}', fontsize = fsize)
ax.set_ylabel(f'{Y.name}', fontsize = fsize)
ax.set_title(f'{Y.name} vs. {X.name}', fontsize = fsize, fontweight = 'bold')
ax.legend()
plt.show()
def plot_points_step(X, Y, start, end, step):
fsize = 30
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.scatter(X[start:end:step], Y[start:end:step], label = 'Data points', linewidth = 3)
ax.set_xlabel(f'{X.name}', fontsize = fsize)
ax.set_ylabel(f'{Y.name}', fontsize = fsize)
ax.set_title(f'{Y.name} vs. {X.name}', fontsize = fsize, fontweight = 'bold')
ax.legend()
plt.show()
def plot_density(t, X, start, end, step, bins):
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
h = ax.hist2d(t[start:end:step], X[start:end:step], bins, linewidth = 3, cmap='plasma')
ax.set_xlabel(f'{t.name}', fontsize = fsize)
ax.set_ylabel(f'{X.name}', fontsize = fsize)
ax.set_title('Folded Data (2D Histogram)', fontsize = fsize, fontweight = 'bold')
ax.set_ylim([0, 20])
fig.colorbar(h[3], ax=ax)
plt.show()
def plot_fold_density(t, X, bins, fold):
t_mod = t % fold
# x = np.arange(start, end, step)
# y = np.mod(X[start:end:step], fold)
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
h = ax.hist2d(t_mod, X, bins, linewidth = 3, cmap='plasma')
ax.set_xlabel(f'{t.name}', fontsize = fsize)
ax.set_ylabel(f'{X.name}', fontsize = fsize)
ax.set_title('Folded Data (2D Histogram)', fontsize = fsize, fontweight = 'bold')
fig.colorbar(h[3], ax=ax)
# ax.set_ylim([0, 2*np.pi])
plt.show()
We use pandas to simplify our data manipulation process.
df = pd.DataFrame(data).T
df.columns = ['GPS Time (s)', 'Solar phase (deg)', 'Earth longitude (deg)', 'Particle counts']
We first look at how the different data columns vary with GPS Time.
plot_points_step(df['GPS Time (s)'], df['Particle counts'], 0, df['GPS Time (s)'].size, 100_000)
plot_points_step(df['GPS Time (s)'], df['Solar phase (deg)'], 0, df['GPS Time (s)'].size, 100_000)
plot_points_step(df['GPS Time (s)'], df['Earth longitude (deg)'], 0, df['GPS Time (s)'].size, 100_000)
Then we can look at how the particle counts vary for the different data columns.
plot_points_step(df['Earth longitude (deg)'], df['Particle counts'], 0, df['Earth longitude (deg)'].size, 10_000)
plot_points_step(df['Solar phase (deg)'], df['Particle counts'], 0, df['Solar phase (deg)'].size, 10_000)
Let's find out if this data follows a mathematical distribution.
X = df['Particle counts']
fsize = 30
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.hist(X[0:100_000], bins = 300, label = 'Data', linewidth = 3, density = True)
ax.set_xlabel(f'{X.name}', fontsize = fsize)
ax.set_ylabel('Probability Mass', fontsize = fsize)
ax.set_title(f'Probability Mass vs. {X.name}', fontsize = fsize, fontweight = 'bold')
ax.legend()
plt.show()
The background seems to follow a Poisson distribution. Thus, as we plan how we may create a $pdf()$ for our background distribution we take note that the background seems to follow this - a Poisson distribution. We confirm this below by plotting a Poisson distribution on top of the data.
mean1 = 6
x = np.linspace(0,25,1000)
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.hist(X[0:100_000], bins = 300, label = 'Data', linewidth = 3, density = True)
ax.plot(x, stats.poisson.pmf(mean1, x)*11, linewidth = 3)
ax.set_xlabel(f'{X.name}', fontsize = fsize)
ax.set_ylabel('Probability Mass', fontsize = fsize)
ax.set_title(f'Probability Mass vs. {X.name}', fontsize = fsize, fontweight = 'bold')
ax.legend()
plt.show()
mean1 = 6
x = np.linspace(0,25,1000)
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.hist(X[0:100_000], bins = 300, label = 'Data', linewidth = 3, density = True)
ax.plot(x, stats.poisson.pmf(mean1, x)*11, label = 'poisson', linewidth = 3)
ax.set_xlabel(f'{X.name}', fontsize = fsize)
ax.set_ylabel('Probability Mass', fontsize = fsize)
ax.set_title(f'Probability Mass vs. {X.name}', fontsize = fsize, fontweight = 'bold')
ax.set_yscale('log')
ax.legend()
plt.show()
However, we see that when we plot this distribution with a Poisson (of our own trial and error) that it is not fully represented by a Poisson distribution. We will keep this in mind in future considerations.
The background is not consistent across the dataset, so we will find and describe as accurately as we can how the background changes.
When we plot the particle counts against the GPS Time, it is almost as if the background varies somehow periodically as we look it it closer.
plot_line_step(df['GPS Time (s)'], df['Particle counts'], 0, df['GPS Time (s)'].size, 10_000)
There is a faint periodic signal here, so let's try to fold the data - by folding the data, we plot it over a certain period such that every time we cross that interval we "restart" plotting from the beginning of the x-axis resulting in a plot showing the density signal not per time, but per fraction (of some sorts) of the period. We see if the data varies by the period that the satellite encircles the Earth (its 'Earth Longitude').
t = df['Earth longitude (deg)']
X = df['Particle counts']
fold = 360
bins = 30
start = 0
end = X.size
step = 1
plot_density(t, X, start, end, step, bins)
It is clear that the period varies with Earth Longitude. Since the background follows a Poisson distribution, we see that the mean (and peak - brigthest yellow in 2D hist plot) of the distribution varies as the satellite orbits the Earth. This pattern is repeated every orbital period (360 degrees Earth Longitude).
Moreover, the mean of this Poisson distribution seems to vary with exponential decay.
Thus, we have a Poisson distribution with a shift dependent on time. Then, we will find the mean of each bin vertically - and we need to know how the mean changes over time $\rightarrow$ we will look at longitude (which is dependent on time)
Let's create a model for the background that includes time dependence and explicitly compare our model to the data. How good will our model of the background be?
We attempt applying the information we found above.
start,end,step = (0,300000,1)
X = df['GPS Time (s)']
Y = df['Earth longitude (deg)']
fsize = 30
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.plot(X[start:end:step], Y[start:end:step], label = 'Data points', linewidth = 3)
ax.set_xlabel(f'{X.name}', fontsize = fsize)
ax.set_ylabel(f'{Y.name}', fontsize = fsize)
ax.set_title(f'{Y.name} vs. {X.name}', fontsize = fsize, fontweight = 'bold')
ax.legend()
plt.show()
As we can see, the Earth Longitude varies periodically with a set time interval in GPS Time. Let's find the period in GPS Time by looking at when the Earth Longitude passes the 0 Meridian.
meridian = 0
df[df['Earth longitude (deg)']==meridian][:10].index
Int64Index([47250, 101250, 155250, 209250, 263250, 317250, 371250, 425250,
479250, 533250],
dtype='int64')
a = df[df['Earth longitude (deg)']==meridian][:10].index[0]
b = df[df['Earth longitude (deg)']==meridian][:10].index[1]
a,b
(47250, 101250)
orbit_interval = df['GPS Time (s)'][b] - df['GPS Time (s)'][a]
df['GPS Time (s)'][b], df['GPS Time (s)'][a], orbit_interval
(940690141.0, 940684741.0, 5400.0)
print(f'Thus our time interval the satellite spends during one Earth orbit is {orbit_interval} s.')
Thus our time interval the satellite spends during one Earth orbit is 5400.0 s.
seconds_from_start = (df['GPS Time (s)'] - df['GPS Time (s)'].iloc[0])
seconds_from_start
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
25919996 2591999.6
25919997 2591999.7
25919998 2591999.8
25919999 2591999.9
25920000 2592000.0
Name: GPS Time (s), Length: 25920001, dtype: float64
seconds_from_start[a], seconds_from_start[b]
(4725.0, 10125.0)
Seems like the mean varies between around a Particle Count of 11 and 6 over a time interval of 5400 s (shifted by 4725.0 s).
Now, back to our distribution. It needs to meet the above "boundary conditions".
$$ 11 \cdot p^t = a,b $$$$ 11 \cdot p^0 = 11 $$$$ 11 \cdot p^{5400} = 6 $$p = np.exp(np.log(6/11)/5400)
p
0.9998877589284689
t1 = 0
t2 = 5400
11*(p**(t1)), 11*(p**(t2))
(11.0, 6.000000000000629)
Thus, our background pdf is
shift = -4050 # We estimate this shift to be about 4000
t = np.arange(0 + shift, 5400 + shift) % 5400
mean_array = 11*(p**(t))
x = np.linspace(0 + 4000,5400 + 4000,10000)
background = stats.poisson(mu=mean_array)
mean_array
array([9.45327575, 9.45221471, 9.45115378, ..., 9.4564596 , 9.4553982 ,
9.45433692])
And the mean of the Poisson distribution follows this exponential decay:
t = df['GPS Time (s)']
X = df['Particle counts']
fold = 5400
bins = 100
t_mod = t % fold
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
h = ax.hist2d(t_mod, X, bins, linewidth = 3, cmap='plasma')
ax.set_xlabel(f'{t.name}', fontsize = fsize)
ax.set_ylabel(f'{X.name}', fontsize = fsize)
ax.set_title('Folded Data (2D Histogram)', fontsize = fsize, fontweight = 'bold')
fig.colorbar(h[3], ax=ax)
ax.plot(mean_array, c='r')
plt.show()
frac = 45/360
5400 * frac
675.0
x = np.linspace(0,5400)
mean_array = 11*(p**(x))
background = stats.poisson(mu=mean_array)
t = df['GPS Time (s)'] + shift
X = df['Particle counts']
fold = 5400
bins = 30
start = 0
end = X.size
step = 1
t_mod = t % fold
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
h = ax.hist2d(t_mod, X, bins, linewidth = 3, cmap='plasma')
ax.set_xlabel(f'{t.name}', fontsize = fsize)
ax.set_ylabel(f'{X.name}', fontsize = fsize)
ax.set_title('Folded Data (2D Histogram)', fontsize = fsize, fontweight = 'bold')
fig.colorbar(h[3], ax=ax)
ax.plot(x, mean_array, c='r')
plt.show()
As we can see, the background is not perfect and could possibly be adjusted further by adjusting the shape of the exponential decay.
We remind ourselves that the mean of our Poisson distribution is changing with the Earth Longitude and thus also with GPS time. It takes 5400 GPS seconds to complete one orbit, and there is some phase shift. Thus, we will use a $t'$ as our independent variable.
We can try a natural exponential instead and apply some 'floor' to this function (since the mean seems to follow an exponential until it hits a floor):
If $\lambda$ is our Poisson distribution's mean, we have $ P(k, \lambda(t')) = \frac{\lambda(t') e^{-\lambda (t')}}{k!} $
and the mean is given by $ \lambda (t') = a e^{-b t'} + c $ and we need to fit the mean to the above 2D-histogram by finding a correct $a$ and $b$. $b$ is our floor.
# These parameters could also have been found using a fitting program, but was found here using educated guesses about the parameters
a = 5.5
b = 0.0008
c = 5.5
t_prime = np.linspace(0,5400)
mean_array = a * np.exp(-b*(t_prime)) + c
background = stats.poisson(mu=mean_array)
t = df['GPS Time (s)'] + shift
X = df['Particle counts']
fold = 5400
bins = 30
start = 0
end = X.size
step = 1
t_mod = t % fold
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
h = ax.hist2d(t_mod, X, bins, linewidth = 3, cmap='plasma')
ax.set_xlabel(f'{t.name}', fontsize = fsize)
ax.set_ylabel(f'{X.name}', fontsize = fsize)
ax.set_title('Folded Data (2D Histogram)', fontsize = fsize, fontweight = 'bold')
fig.colorbar(h[3], ax=ax)
ax.plot(t_prime, mean_array, c='r')
plt.show()
This background is a much better fit to our data above as it fits what most probable value of the background with time.
Because the background varies, our discovery sensitivity threshold (how many particles you would need to see) also varies. What is the '5-sigma' threshold for a 100 millisecond GRB at different times?
sigma = 5
prob = stats.norm.sf(sigma)
prob
2.866515718791933e-07
At time $t=0$:
time1 = 0
mean1 = a * np.exp(-b*(time1)) + c
threshold1 = stats.poisson.ppf(stats.norm.cdf(5), mu=mean1)
threshold1
31.0
At time $t=3000$:
time2 = 3000
mean2 = a * np.exp(-b*(time2)) + c
threshold2 = stats.poisson.ppf(stats.norm.cdf(5), mu=mean2)
threshold2
22.0
print(f'So at time={time1}s the 5-sigma threshold is {threshold1} cosmic rays and at time={time2}s the threshold is {threshold2} cosmic rays.')
So at time=0s the 5-sigma threshold is 31.0 cosmic rays and at time=3000s the threshold is 22.0 cosmic rays.
This varying background may be due to the satellite being exposed periodically to the sun or the moon as it orbits the Earth.
In this problem we are going to look at a stack of telescope images (again simulated). We have 10 images, each with transients (something like a super nova that only appears in one image) and faint stars. We will be looking for transients.
We dowload the data from images.h5. This is a stack of 10 square images, each 200 pixels on a side.
hf = h5py.File('images.h5', 'r')
hf.keys()
<KeysViewHDF5 ['image1', 'imagestack']>
image1 = np.array(hf.get('image1'))
imagestack = np.array(hf.get('imagestack'))
hf['image1'], hf['imagestack']
(<HDF5 dataset "image1": shape (200, 200), type "<f8">, <HDF5 dataset "imagestack": shape (200, 200, 10), type "<f8">)
image1[:,0]
array([-1.24711126, 0.09727501, 0.33900351, 0.29558254, 1.91368599,
0.07626013, 0.03986838, -0.25576938, -0.22294167, -0.65113439,
-0.01398361, 0.70831161, -0.09705982, -0.30476611, -0.47555002,
0.29868733, -0.35539496, -0.12378233, -0.40464966, 0.07679724,
-0.25425321, 0.03919398, -0.54560791, -0.26625077, 0.07074571,
-0.13175857, -0.18360689, -0.14991885, -0.29985518, 1.37532486,
-0.19475698, -0.36960906, 0.48550722, -0.67534756, -0.64634097,
0.88276573, -0.16895827, -0.42872912, -0.02869665, 0.43524321,
0.24346419, 1.10485685, -0.33400717, 0.51251083, 0.90152251,
0.92508949, -0.83172978, 0.45036765, 1.16175577, -0.80162552,
-0.36378374, 0.48153328, -0.32109448, 0.14591758, 0.77390384,
0.11669285, 0.5746037 , 0.27299184, 0.4348156 , -0.15030213,
0.05540169, 0.38094349, 0.6311626 , 0.19981589, 0.63349839,
0.43168284, 0.09893955, -0.85494294, 0.93242997, 0.37957992,
-0.08278824, 0.80682018, 0.15010843, -0.95922717, -1.26249655,
-0.94637467, -0.15352023, -0.07384016, -0.41803639, 0.4982668 ,
0.66809981, 0.72741214, -0.18094812, 0.0972217 , -0.2418176 ,
-1.06333105, -0.03102494, -0.19620065, 0.25379191, -0.33746636,
-0.27249717, 0.0701126 , 0.60899958, -0.73714658, -0.4902708 ,
-0.22620815, 0.45333986, 0.19660499, -0.80944926, 0.19988036,
-1.1648606 , -0.16066758, -0.45766878, -0.02665867, 0.46653855,
0.2220366 , -0.60288937, -0.52148233, -0.29863143, -0.27530732,
-0.5097406 , -0.06689021, 0.26683793, -0.09194396, 0.68719753,
0.2019346 , 0.74079708, 0.00796042, 0.55804657, 0.20961482,
0.17362284, -0.32905264, 0.21144649, 0.19887555, 0.50027729,
0.41176336, 0.2161396 , 0.7559562 , 0.10839095, -0.54405494,
0.67476506, 0.33172359, 0.58006441, 0.10053878, 0.09450886,
0.24487801, -0.96764838, 0.02257392, -0.05627404, -1.09493906,
-0.30909209, 0.31138105, -0.13693753, 1.22042577, 0.35058191,
0.26441676, 0.95097704, 0.22796594, -0.42044 , 0.79805597,
0.65795808, -0.53318533, -0.55597433, 0.07928304, -0.09236736,
-1.62689272, -0.02254982, 0.37533826, 0.46281651, 0.4968663 ,
-0.61484966, -0.48875166, 0.04894066, 0.39827377, -0.10714246,
-0.76350925, -0.37130876, -1.56810692, -0.10983415, -0.47270383,
-0.49466462, -0.77631473, 0.42848326, 0.95483591, 0.25712471,
-0.05448012, -0.10635216, -0.11081593, -0.00479448, -0.40168115,
-0.00845614, 0.9981043 , -0.03942449, 0.35160042, -0.29764412,
-0.84577687, -1.21221407, 0.25012219, -0.92180976, 0.68394382,
-0.13802771, 0.52752293, 0.28477364, -0.04521836, 0.4203928 ,
-0.03778918, -0.21000001, 0.5566705 , -0.32054855, 1.16270664])
You can then close your file
hf.close()
Explore the data. Is there signal contamination? Is the background time dependent? Is it consistent spatially? Develop a plan to calculate your background pdf().
We then explore the data to see if there is signal contamination, if the background is time-dependent or if it is constistent spatially.
We will eventually calculate a background $pdf()$.
print(image1.shape)
(200, 200)
plt.imshow(image1, cmap='Greys')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f92e3966bb0>
We can access for example image#5 in the imagestack like this:
plt.rcParams["figure.figsize"] = (6,6)
for imagenumber in range(0,10):
plt.imshow(imagestack[:,:,imagenumber], cmap='Greys')
plt.colorbar()
plt.show()
There is most definitely a source of background noise - the faint stars that appear as faint gray dots in the background of the darker black dots.
plt.plot(image1);
for imagenumber in range(0,10):
plt.plot(imagestack[imagenumber]);
plt.show()
Again, here we easily see that some of the images have clear signal detections (transients) in the form of spikes while others have more background noise. So what does our noise profile look like?
We're looking at intensity vs. counts of pixels with that intensity $\rightarrow$ Essentially the distribution of intensity over the CCD.
We try using the imagestack.
flat_list = []
for sublist in imagestack[0]:
for item in sublist:
flat_list.append(item)
min(flat_list), max(flat_list)
(-1.6994599247013218, 1.9955820607564105)
bins = np.linspace(np.floor(min(flat_list)), np.ceil(max(flat_list)+1), 40)
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.hist(flat_list, bins, linewidth = 3)
ax.set_xlabel(f'Intensity', fontsize = fsize)
ax.set_ylabel(f'Number of pixels', fontsize = fsize)
ax.set_title('CCD intensity vs. number of pixels', fontsize = fsize, fontweight = 'bold')
ax.set_yscale('log')
plt.show()
background = stats.norm(loc=0, scale=0.5)
x = np.linspace(-2, 2, 1000)
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.hist(flat_list, bins, linewidth = 3)
ax.plot(x, background.pdf(x)*2.7e2, linewidth = 3, label = 'gaussian')
ax.set_xlabel(f'Intensity', fontsize = fsize)
ax.set_ylabel(f'Number of pixels', fontsize = fsize)
ax.set_title('CCD intensity vs. number of pixels', fontsize = fsize, fontweight = 'bold')
ax.set_yscale('log')
ax.set_ylim([1e-1, 3e4])
ax.legend()
plt.show()
We can see that the images have some evenly distributed background noise that follows a Gaussian distribution. By analyzing the mean and standard deviation of the data within some interval where the background is well represented by a Gaussian, we can define the background in terms of a Gaussian with a mean and standard deviation. Using this approach, we can eventually find the significance of a transient signal (that of for example a bright supernova).
image1 = imagestack[:, :, 0].flatten()
rel1 = image1[image1 < 2]
mean1 = np.mean(rel1)
std1 = np.std(rel1)
x = np.linspace(-5, 5, 1000)
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.hist(image1, bins=1000, density=True)
ax.plot(x, stats.norm(mean1, std1).pdf(x), linewidth=5, alpha=0.6, label='gaussian')
ax.set_xlabel('Data', fontsize=fsize)
ax.set_ylabel('Probability Density', fontsize=fsize)
ax.set_title('Image 1 distribution', fontsize=fsize, fontweight='bold')
ax.set_ylim([1e-5, None])
ax.set_yscale('log')
ax.legend()
plt.show()
print(f'Mean={-mean1:.2f}, Standard Deviation={std1:.2f}');
Mean=0.00, Standard Deviation=0.56
Having demonstrated our method for finding the Gaussian background above, we can include all 10 images in the imagestack.
image = imagestack[:, :, :].flatten()
rel = image[image < 2]
mean = np.mean(rel)
std = np.std(rel)
x = np.linspace(-5, 5, 1000)
fig, ax = plt.subplots(1, 1)
plt.tick_params(labelsize = 20)
ax.hist(image, bins=1000, density=True)
ax.plot(x, stats.norm(mean, std).pdf(x), linewidth=5, alpha=0.6, label='gaussian')
ax.set_xlabel('Data', fontsize=fsize)
ax.set_ylabel('Probability Density', fontsize=fsize)
ax.set_title('All images in imagestack', fontsize=fsize, fontweight='bold')
ax.set_ylim([1e-5, None])
ax.set_yscale('log')
ax.legend()
print(f'Mean={-mean:.2f}, Standard Deviation={std:.2f}');
Mean=0.00, Standard Deviation=0.56
We note that the standard deviation and mean of the Gaussian background did not change when including all images.
print(f'Thus, our background is a Gaussian with a mean of {abs(mean):.2f} and a standard deviation of {std:.2f}.')
Thus, our background is a Gaussian with a mean of 0.00 and a standard deviation of 0.56.
So let's use this background distribution to hunt for a signal (a transient). We will look for a candidate with one of the higher intensities in the plot - one with an intensity value of ~10 (which we can see yields some detection on the plot above). Since we have our background, we can determine significances.
We have a background that we will use to detect the significance of an event for a single image. We will not average over several images since we are looking for an event happening in one of the images that is not generally happening in the other images.
background = stats.norm(loc=mean, scale=std)
Y = 10
prob = background.sf(Y)
sigma = abs(stats.norm.ppf(prob))
prob, sigma
(1.894376003509495e-71, 17.83487808130949)
print(f'Thus, the sigma of a signal Y = {Y} is {sigma:.1f}.')
Thus, the sigma of a signal Y = 10 is 17.8.
This signal can be described as a very significant one as its sigma-value is very large (it is very unlikely the background produced this signal).
My lab partner (who chose to look for faint stars instead of transients) had different pdf()s, but were using the same data.
When looking for faint stars, we can average all images as each image's faint stars will not change when looking at a different image of the same area of the sky. Thus, we can allow us to average the background of the images.
Since the standard deviation of an averaged background distribution will be smaller (a narrower distribution) than for our 'transient background distribution' we will expect a lower threshold for the discovery of a faint star than for a transient event. This makes sense as finding a faint star demands a lower intensity than for a transient, but for a faint star one can gain extra 'confidence' in the measurement by adding the data from other images - effectively helping to confirm the existence of the faint star.